Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 101.080
Filtrar
1.
Genome Biol ; 25(1): 83, 2024 Apr 02.
Artículo en Inglés | MEDLINE | ID: mdl-38566111

RESUMEN

BACKGROUND: The rise of large-scale multi-species genome sequencing projects promises to shed new light on how genomes encode gene regulatory instructions. To this end, new algorithms are needed that can leverage conservation to capture regulatory elements while accounting for their evolution. RESULTS: Here, we introduce species-aware DNA language models, which we trained on more than 800 species spanning over 500 million years of evolution. Investigating their ability to predict masked nucleotides from context, we show that DNA language models distinguish transcription factor and RNA-binding protein motifs from background non-coding sequence. Owing to their flexibility, DNA language models capture conserved regulatory elements over much further evolutionary distances than sequence alignment would allow. Remarkably, DNA language models reconstruct motif instances bound in vivo better than unbound ones and account for the evolution of motif sequences and their positional constraints, showing that these models capture functional high-order sequence and evolutionary context. We further show that species-aware training yields improved sequence representations for endogenous and MPRA-based gene expression prediction, as well as motif discovery. CONCLUSIONS: Collectively, these results demonstrate that species-aware DNA language models are a powerful, flexible, and scalable tool to integrate information from large compendia of highly diverged genomes.


Asunto(s)
ADN , Secuencias Reguladoras de Ácidos Nucleicos , Sitios de Unión , Alineación de Secuencia , Algoritmos , Secuencia Conservada/genética , Evolución Molecular
2.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38600663

RESUMEN

Protein sequence design can provide valuable insights into biopharmaceuticals and disease treatments. Currently, most protein sequence design methods based on deep learning focus on network architecture optimization, while ignoring protein-specific physicochemical features. Inspired by the successful application of structure templates and pre-trained models in the protein structure prediction, we explored whether the representation of structural sequence profile can be used for protein sequence design. In this work, we propose SPDesign, a method for protein sequence design based on structural sequence profile using ultrafast shape recognition. Given an input backbone structure, SPDesign utilizes ultrafast shape recognition vectors to accelerate the search for similar protein structures in our in-house PAcluster80 structure database and then extracts the sequence profile through structure alignment. Combined with structural pre-trained knowledge and geometric features, they are further fed into an enhanced graph neural network for sequence prediction. The results show that SPDesign significantly outperforms the state-of-the-art methods, such as ProteinMPNN, Pifold and LM-Design, leading to 21.89%, 15.54% and 11.4% accuracy gains in sequence recovery rate on CATH 4.2 benchmark, respectively. Encouraging results also have been achieved on orphan and de novo (designed) benchmarks with few homologous sequences. Furthermore, analysis conducted by the PDBench tool suggests that SPDesign performs well in subdivided structures. More interestingly, we found that SPDesign can well reconstruct the sequences of some proteins that have similar structures but different sequences. Finally, the structural modeling verification experiment indicates that the sequences designed by SPDesign can fold into the native structures more accurately.


Asunto(s)
Redes Neurales de la Computación , Proteínas , Alineación de Secuencia , Secuencia de Aminoácidos , Proteínas/química , Análisis de Secuencia de Proteína/métodos
3.
PLoS Comput Biol ; 20(4): e1011988, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38557416

RESUMEN

Accurate multiple sequence alignment (MSA) is imperative for the comprehensive analysis of biological sequences. However, a notable challenge arises as no single MSA tool consistently outperforms its counterparts across diverse datasets. Users often have to try multiple MSA tools to achieve optimal alignment results, which can be time-consuming and memory-intensive. While the overall accuracy of certain MSA results may be lower, there could be local regions with the highest alignment scores, prompting researchers to seek a tool capable of merging these locally optimal results from multiple initial alignments into a globally optimal alignment. In this study, we introduce Two Pointers Meta-Alignment (TPMA), a novel tool designed for the integration of nucleic acid sequence alignments. TPMA employs two pointers to partition the initial alignments into blocks containing identical sequence fragments. It selects blocks with the high sum of pairs (SP) scores to concatenate them into an alignment with an overall SP score superior to that of the initial alignments. Through tests on simulated and real datasets, the experimental results consistently demonstrate that TPMA outperforms M-Coffee in terms of aSP, Q, and total column (TC) scores across most datasets. Even in cases where TPMA's scores are comparable to M-Coffee, TPMA exhibits significantly lower running time and memory consumption. Furthermore, we comprehensively assessed all the MSA tools used in the experiments, considering accuracy, time, and memory consumption. We propose accurate and fast combination strategies for small and large datasets, which streamline the user tool selection process and facilitate large-scale dataset integration. The dataset and source code of TPMA are available on GitHub (https://github.com/malabz/TPMA).


Asunto(s)
Algoritmos , Ácidos Nucleicos , Alineación de Secuencia , Café , Programas Informáticos
4.
PLoS One ; 19(4): e0298164, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38574063

RESUMEN

SARS-CoV-2, the causative agent of COVID-19, is known to exhibit secondary structures in its 5' and 3' untranslated regions, along with the frameshifting stimulatory element situated between ORF1a and 1b. To identify additional regions containing conserved structures, we utilized a multiple sequence alignment with related coronaviruses as a starting point. We applied a computational pipeline developed for identifying non-coding RNA elements. Our pipeline employed three different RNA structural prediction approaches. We identified forty genomic regions likely to harbor structures, with ten of them showing three-way consensus substructure predictions among our predictive utilities. We conducted intracomparisons of the predictive utilities within the pipeline and intercomparisons with four previously published SARS-CoV-2 structural datasets. While there was limited agreement on the precise structure, different approaches seemed to converge on regions likely to contain structures in the viral genome. By comparing and combining various computational approaches, we can predict regions most likely to form structures, as well as a probable structure or ensemble of structures. These predictions can be used to guide surveillance, prophylactic measures, or therapeutic efforts. Data and scripts employed in this study may be found at https://doi.org/10.5281/zenodo.8298680.


Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , COVID-19/genética , Alineación de Secuencia , Genoma Viral/genética , ARN Viral/genética , ARN Viral/química
5.
Bioinformatics ; 40(3)2024 Mar 04.
Artículo en Inglés | MEDLINE | ID: mdl-38485699

RESUMEN

MOTIVATION: Local alignments of query sequences in large databases represent a core part of metagenomic studies and facilitate homology search. Following the development of NCBI Blast, many applications aimed to provide faster and equally sensitive local alignment frameworks. Most applications focus on protein alignments, while only few also facilitate DNA-based searches. None of the established programs allow searching DNA sequences from bisulfite sequencing experiments commonly used for DNA methylation profiling, for which specific alignment strategies need to be implemented. RESULTS: Here, we introduce Lambda3, a new version of the local alignment application Lambda. Lambda3 is the first solution that enables the search of protein, nucleotide as well as bisulfite-converted nucleotide query sequences. Its protein mode achieves comparable performance to that of the highly optimized protein alignment application Diamond, while the nucleotide mode consistently outperforms established local nucleotide aligners. Combined, Lambda3 presents a universal local alignment framework that enables fast and sensitive homology searches for a wide range of use-cases. AVAILABILITY AND IMPLEMENTATION: Lambda3 is free and open-source software publicly available at https://github.com/seqan/lambda/.


Asunto(s)
Algoritmos , Programas Informáticos , Sulfitos , Alineación de Secuencia , Proteínas
6.
Sci Rep ; 14(1): 6009, 2024 03 12.
Artículo en Inglés | MEDLINE | ID: mdl-38472223

RESUMEN

Protein-protein interactions (PPIs) play essential roles in most biological processes. The binding interfaces between interacting proteins impose evolutionary constraints that have successfully been employed to predict PPIs from multiple sequence alignments (MSAs). To construct MSAs, critical choices have to be made: how to ensure the reliable identification of orthologs, and how to optimally balance the need for large alignments versus sufficient alignment quality. Here, we propose a divide-and-conquer strategy for MSA generation: instead of building a single, large alignment for each protein, multiple distinct alignments are constructed under distinct clades in the tree of life. Coevolutionary signals are searched separately within these clades, and are only subsequently integrated using machine learning techniques. We find that this strategy markedly improves overall prediction performance, concomitant with better alignment quality. Using the popular DCA algorithm to systematically search pairs of such alignments, a genome-wide all-against-all interaction scan in a bacterial genome is demonstrated. Given the recent successes of AlphaFold in predicting direct PPIs at atomic detail, a discover-and-refine approach is proposed: our method could provide a fast and accurate strategy for pre-screening the entire genome, submitting to AlphaFold only promising interaction candidates-thus reducing false positives as well as computation time.


Asunto(s)
Algoritmos , Proteínas , Alineación de Secuencia , Proteínas/genética , Evolución Biológica , Filogenia , Biología Computacional/métodos
7.
Genes (Basel) ; 15(3)2024 Mar 07.
Artículo en Inglés | MEDLINE | ID: mdl-38540400

RESUMEN

Bioinformatics is a rapidly developing field enabling scientific experiments via computer models and simulations. In recent years, there has been an extraordinary growth in biological databases. Therefore, it is extremely important to propose effective methods and algorithms for the fast and accurate processing of biological data. Sequence comparisons are the best way to investigate and understand the biological functions and evolutionary relationships between genes on the basis of the alignment of two or more DNA sequences in order to maximize the identity level and degree of similarity. This paper presents a new version of the pairwise DNA sequences alignment algorithm, based on a new method called CAT, where a dependency with a previous match and the closest neighbor are taken into consideration to increase the uniqueness of the CAT profile and to reduce possible collisions, i.e., two or more sequence with the same CAT profiles. This makes the proposed algorithm suitable for finding the exact match of a concrete DNA sequence in a large set of DNA data faster. In order to enable the usage of the profiles as sequence metadata, CAT profiles are generated once prior to data uploading to the database. The proposed algorithm consists of two main stages: CAT profile calculation depending on the chosen benchmark sequences and sequence comparison by using the calculated CAT profiles. Improvements in the generation of the CAT profiles are detailed and described in this paper. Block schemes, pseudo code tables, and figures were updated according to the proposed new version and experimental results. Experiments were carried out using the new version of the CAT method for DNA sequence alignment and different datasets. New experimental results regarding collisions, speed, and efficiency of the suggested new implementation are presented. Experiments related to the performance comparison with Needleman-Wunsch were re-executed with the new version of the algorithm to confirm that we have the same performance. A performance analysis of the proposed algorithm based on the CAT method against the Knuth-Morris-Pratt algorithm, which has a complexity of O(n) and is widely used for biological data searching, was performed. The impact of prior matching dependencies on uniqueness for generated CAT profiles is investigated. The experimental results from sequence alignment demonstrate that the proposed CAT method-based algorithm exhibits minimal deviation, which can be deemed negligible if such deviation is considered permissible in favor of enhanced performance. It should be noted that the performance of the CAT algorithm in terms of execution time remains stable, unaffected by the length of the analyzed sequences. Hence, the primary benefit of the suggested approach lies in its rapid processing capabilities in large-scale sequence alignment, a task that traditional exact algorithms would require significantly more time to perform.


Asunto(s)
Algoritmos , ADN , Secuencia de Bases , Alineación de Secuencia , Simulación por Computador , ADN/genética
8.
Int J Mol Sci ; 25(6)2024 Mar 16.
Artículo en Inglés | MEDLINE | ID: mdl-38542339

RESUMEN

Myosin, a superfamily of motor proteins, obtain the energy they require for movement from ATP hydrolysis to perform various functions by binding to actin filaments. Extensive studies have clarified the diverse functions performed by the different isoforms of myosin. However, the unavailability of resolved structures has made it difficult to understand the way in which their mechanochemical cycle and structural diversity give rise to distinct functional properties. With this study, we seek to further our understanding of the structural organization of the myosin 7A motor domain by modeling the tertiary structure of myosin 7A based on its primary sequence. Multiple sequence alignment and a comparison of the models of different myosin isoforms and myosin 7A not only enabled us to identify highly conserved nucleotide binding sites but also to predict actin binding sites. In addition, the actomyosin-7A complex was predicted from the protein-protein interaction model, from which the core interface sites of actin and the myosin 7A motor domain were defined. Finally, sequence alignment and the comparison of models were used to suggest the possibility of a pliant region existing between the converter domain and lever arm of myosin 7A. The results of this study provide insights into the structure of myosin 7A that could serve as a framework for higher resolution studies in future.


Asunto(s)
Actinas , Miosinas , Actinas/metabolismo , Alineación de Secuencia , Estructura Terciaria de Proteína , Miosinas/metabolismo , Unión Proteica , Isoformas de Proteínas/metabolismo , Adenosina Trifosfato/metabolismo
9.
Microbes Environ ; 39(1)2024.
Artículo en Inglés | MEDLINE | ID: mdl-38508742

RESUMEN

With the explosion of available genomic information, comparative genomics has become a central approach to understanding microbial ecology and evolution. We developed DiGAlign (https://www.genome.jp/digalign/), a web server that provides versatile functionality for comparative genomics with an intuitive interface. It allows the user to perform the highly customizable visualization of a synteny map by simply uploading nucleotide sequences of interest, ranging from a specific region to the whole genome landscape of microorganisms and viruses. DiGAlign will serve a wide range of biological researchers, particularly experimental biologists, with multifaceted features that allow the rapid characterization of genomic sequences of interest and the generation of a publication-ready figure.


Asunto(s)
Programas Informáticos , Interfaz Usuario-Computador , Alineación de Secuencia , Genómica , Genoma
10.
Arch Virol ; 169(4): 79, 2024 Mar 22.
Artículo en Inglés | MEDLINE | ID: mdl-38519762

RESUMEN

A novel double-strand RNA (dsRNA) mycovirus, named "Colletotrichum fioriniae alternavirus1" (CfAV1), was isolated from the strain CX7 of Colletotrichum fioriniae, the causal agent of walnut anthracnose. The complete genome of CfAV1 is composed of three dsRNA segments: dsRNA1 (3528 bp), dsRNA2 (2485 bp), and dsRNA3 (2481 bp). The RNA-dependent RNA polymerase (RdRp) is encoded by dsRNA1, while both dsRNA2 and dsRNA3 encode hypothetical proteins. Based on multiple sequence alignments and phylogenetic analysis, CfAV1 is identified as a new member of the family Alternaviridae. This is the first report of an alternavirus that infects the phytopathogenic fungus C. fioriniae.


Asunto(s)
Colletotrichum , Virus Fúngicos , Virus ARN , Filogenia , Genoma Viral , Colletotrichum/genética , Alineación de Secuencia , ARN Bicatenario/genética , ARN Viral/genética , Sistemas de Lectura Abierta
11.
Int J Biol Macromol ; 264(Pt 2): 130739, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38460639

RESUMEN

Extradiol dioxygenases (EDOs) catalyzing meta-cleavage of catecholic compounds promise an effective way to detoxify aromatic pollutants. This work reported a novel scenario to engineer our recently identified Type I EDO from Tcu3516 for a broader substrate scope and enhanced activity, which was based on 2,3-dihydroxybiphenyl (2,3-DHB)-liganded molecular docking of Tcu3516 and multiple sequence alignment with other 22 Type I EDOs. 11 non-conservative residues of Tcu3516 within 6 Å distance to the 2,3-DHB ligand center were selected as potential hotspots and subjected to semi-rational design using 6 catecholic analogues as substrates; the mutants V186L and V212N returned with progressive evolution in substrate scope and catalytic activity. Both mutants were combined with D285A for construction of double mutants and final triple mutant V186L/V212N/D285A. Except for 2,3-DHB (the mutant V186L/D285A gave the best catalytic performance), the triple mutant prevailed all other 5 catecholic compounds for their degradation; affording the catalytic efficiency kcat/Km value increase by 10-30 folds, protein Tm (structural rigidity) increase by 15 °C and the half-life time enhancement by 10 times compared to the wild type Tcu3516. The molecular dynamic simulation suggested that a stabler core and a more flexible entrance are likely accounting for enhanced catalytic activity and stability of enzymes.


Asunto(s)
Compuestos Orgánicos , Oxigenasas , Simulación del Acoplamiento Molecular , Oxigenasas/química , Alineación de Secuencia , Especificidad por Sustrato
12.
Gene ; 911: 148338, 2024 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-38438056

RESUMEN

DAX1 (dosage-sensitive sex reversal, adrenal hypoplasia congenital critical region on X chromosome gene 1), a key sex determinant in various species, plays a vital role in gonad differentiation and development and controls spermatogenesis. However, the identity and function of DAX1 are still unclear in bivalves. In the present study, we identified a DAX1 (designed as Tc-DAX1) gene from the boring giant clam Tridacna crocea, a tropical marine bivalve. The full length of Tc-DAX1 was 1877 bp, encoding 462 amino acids, with a Molecular weight of 51.81 kDa and a theoretical Isoelectric point of 5.87 (pI). Multiple sequence alignments and phylogenetic analysis indicated a putative ligand binding domain (LBD) conserved regions clustered with molluscans DAX1 homologs. The tissue distributions in different reproductive stages revealed a dimorphic pattern, with the highest expression trend in the male reproductive stage, indicating its role in spermatogenesis. The DAX1 expression data from embryonic stages shows its highest expression profile (P < 0.05) in the zygote stage, followed by decreasing trends in the larvae stages (P > 0.05). The localization of DAX1 transcripts has also been confirmed by whole mount in situ hybridization, showing high positive signals in the fertilized egg, 2, and 4-cell stage, and gastrula. Moreover, RNAi knockdown of the Tc-DAX1 transcripts shows a significantly lower expression profile in the ds-DAX1 group compared to the ds-EGFP group. Subsequent histological analysis of gonads revealed that spermatogenesis was affected in a ds-DAX1 group compared to the ds-EGFP group. All these results indicate that Tc-DAX1 is involved in the spermatogenesis and early embryonic development of T. crocea, providing valuable information for the breeding and aquaculture of giant clams.


Asunto(s)
Bivalvos , Gónadas , Masculino , Animales , Filogenia , Gónadas/metabolismo , Espermatogénesis/genética , Alineación de Secuencia , Bivalvos/genética , Receptor Nuclear Huérfano DAX-1/genética , Receptor Nuclear Huérfano DAX-1/metabolismo
13.
Elife ; 122024 Mar 15.
Artículo en Inglés | MEDLINE | ID: mdl-38488154

RESUMEN

Accurately detecting distant evolutionary relationships between proteins remains an ongoing challenge in bioinformatics. Search methods based on primary sequence struggle to accurately detect homology between sequences with less than 20% amino acid identity. Profile- and structure-based strategies extend sensitive search capabilities into this twilight zone of sequence similarity but require slow pre-processing steps. Recently, whole-protein and positional embeddings from deep neural networks have shown promise for providing sensitive sequence comparison and annotation at long evolutionary distances. Embeddings are generally faster to compute than profiles and predicted structures but still suffer several drawbacks related to the ability of whole-protein embeddings to discriminate domain-level homology, and the database size and search speed of methods using positional embeddings. In this work, we show that low-dimensionality positional embeddings can be used directly in speed-optimized local search algorithms. As a proof of concept, we use the ESM2 3B model to convert primary sequences directly into the 3D interaction (3Di) alphabet or amino acid profiles and use these embeddings as input to the highly optimized Foldseek, HMMER3, and HH-suite search algorithms. Our results suggest that positional embeddings as small as a single byte can provide sufficient information for dramatically improved sensitivity over amino acid sequence searches without sacrificing search speed.


Asunto(s)
Algoritmos , Proteínas , Alineación de Secuencia , Proteínas/genética , Proteínas/química , Secuencia de Aminoácidos , Biología Computacional/métodos , Aminoácidos
14.
Bioinformatics ; 40(4)2024 Mar 29.
Artículo en Inglés | MEDLINE | ID: mdl-38532297

RESUMEN

MOTIVATION: Computational methods to detect correlated amino acid positions in proteins have become a valuable tool to predict intra- and inter-residue protein contacts, protein structures, and effects of mutation on protein stability and function. While there are many tools and webservers to compute coevolution scoring matrices, there is no central repository of alignments and coevolution matrices for large-scale studies and pattern detection leveraging on biological and structural annotations already available in UniProt. RESULTS: We present a Python library, PyCoM, which enables users to query and analyze coevolution matrices and sequence alignments of 457 622 proteins, selected from UniProtKB/Swiss-Prot database (length ≤ 500 residues), from a precompiled coevolution matrix database (PyCoMdb). PyCoM facilitates the development of statistical analyses of residue coevolution patterns using filters on biological and structural annotations from UniProtKB/Swiss-Prot, with simple access to PyCoMdb for both novice and advanced users, supporting Jupyter Notebooks, Python scripts, and a web API access. The resource is open source and will help in generating data-driven computational models and methods to study and understand protein structures, stability, function, and design. AVAILABILITY AND IMPLEMENTATION: PyCoM code is freely available from https://github.com/scdantu/pycom and PyCoMdb and the Jupyter Notebook tutorials are freely available from https://pycom.brunel.ac.uk.


Asunto(s)
Proteínas , Programas Informáticos , Proteínas/química , Alineación de Secuencia , Aminoácidos , Bases de Datos de Proteínas
15.
J Math Biol ; 88(5): 50, 2024 Mar 29.
Artículo en Inglés | MEDLINE | ID: mdl-38551701

RESUMEN

Network alignment aims to uncover topologically similar regions in the protein-protein interaction (PPI) networks of two or more species under the assumption that topologically similar regions tend to perform similar functions. Although there exist a plethora of both network alignment algorithms and measures of topological similarity, currently no "gold standard" exists for evaluating how well either is able to uncover functionally similar regions. Here we propose a formal, mathematically and statistically rigorous method for evaluating the statistical significance of shared GO terms in a global, 1-to-1 alignment between two PPI networks. Given an alignment in which k aligned protein pairs share a particular GO term g, we use a combinatorial argument to precisely quantify the p-value of that alignment with respect to g compared to a random alignment. The p-value of the alignment with respect to all GO terms, including their inter-relationships, is approximated using the Empirical Brown's Method. We note that, just as with BLAST's p-values, this method is not designed to guide an alignment algorithm towards a solution; instead, just as with BLAST, an alignment is guided by a scoring matrix or function; the p-values herein are computed after the fact, providing independent feedback to the user on the biological quality of the alignment that was generated by optimizing the scoring function. Importantly, we demonstrate that among all GO-based measures of network alignments, ours is the only one that correlates with the precision of GO annotation predictions, paving the way for network alignment-based protein function prediction.


Asunto(s)
Algoritmos , Biología Computacional , Ontología de Genes , Biología Computacional/métodos , Alineación de Secuencia , Mapas de Interacción de Proteínas , Proteínas/genética
16.
PeerJ ; 12: e16890, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38464752

RESUMEN

Despite millions of SARS-CoV-2 genomes being sequenced and shared globally, manipulating such data sets is still challenging, especially selecting sequences for focused phylogenetic analysis. We present a novel method, uvaia, which is based on partial and exact sequence similarity for quickly extracting database sequences similar to query sequences of interest. Many SARS-CoV-2 phylogenetic analyses rely on very low numbers of ambiguous sites as a measure of quality since ambiguous sites do not contribute to single nucleotide polymorphism (SNP) differences. Uvaia overcomes this limitation by using measures of sequence similarity which consider partially ambiguous sites, allowing for more ambiguous sequences to be included in the analysis if needed. Such fine-grained definition of similarity allows not only for better phylogenetic analyses, but could also lead to improved classification and biogeographical inferences. Uvaia works natively with compressed files, can use multiple cores and efficiently utilises memory, being able to analyse large data sets on a standard desktop.


Asunto(s)
Computadores , SARS-CoV-2 , Filogenia , Alineación de Secuencia , SARS-CoV-2/genética
17.
Proc Natl Acad Sci U S A ; 121(13): e2308788121, 2024 Mar 26.
Artículo en Inglés | MEDLINE | ID: mdl-38507445

RESUMEN

Protein structure prediction has been greatly improved by deep learning in the past few years. However, the most successful methods rely on multiple sequence alignment (MSA) of the sequence homologs of the protein under prediction. In nature, a protein folds in the absence of its sequence homologs and thus, a MSA-free structure prediction method is desired. Here, we develop a single-sequence-based protein structure prediction method RaptorX-Single by integrating several protein language models and a structure generation module and then study its advantage over MSA-based methods. Our experimental results indicate that in addition to running much faster than MSA-based methods such as AlphaFold2, RaptorX-Single outperforms AlphaFold2 and other MSA-free methods in predicting the structure of antibodies (after fine-tuning on antibody data), proteins of very few sequence homologs, and single mutation effects. By comparing different protein language models, our results show that not only the scale but also the training data of protein language models will impact the performance. RaptorX-Single also compares favorably to MSA-based AlphaFold2 when the protein under prediction has a large number of sequence homologs.


Asunto(s)
Anticuerpos , Proteínas , Proteínas/genética , Proteínas/química , Anticuerpos/genética , Alineación de Secuencia , Algoritmos
18.
Nat Commun ; 15(1): 2464, 2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38538622

RESUMEN

This paper presents an innovative approach for predicting the relative populations of protein conformations using AlphaFold 2, an AI-powered method that has revolutionized biology by enabling the accurate prediction of protein structures. While AlphaFold 2 has shown exceptional accuracy and speed, it is designed to predict proteins' ground state conformations and is limited in its ability to predict conformational landscapes. Here, we demonstrate how AlphaFold 2 can directly predict the relative populations of different protein conformations by subsampling multiple sequence alignments. We tested our method against nuclear magnetic resonance experiments on two proteins with drastically different amounts of available sequence data, Abl1 kinase and the granulocyte-macrophage colony-stimulating factor, and predicted changes in their relative state populations with more than 80% accuracy. Our subsampling approach worked best when used to qualitatively predict the effects of mutations or evolution on the conformational landscape and well-populated states of proteins. It thus offers a fast and cost-effective way to predict the relative populations of protein conformations at even single-point mutation resolution, making it a useful tool for pharmacology, analysis of experimental results, and predicting evolution.


Asunto(s)
Mutación Puntual , Conformación Proteica , Mutación , Alineación de Secuencia
19.
Int J Mol Sci ; 25(4)2024 Feb 08.
Artículo en Inglés | MEDLINE | ID: mdl-38396751

RESUMEN

Chitin deacetylase (CDA) can catalyze the deacetylation of chitin to produce chitosan. In this study, we identified and characterized a chitin deacetylase gene from Euphausia superba (EsCDA-9k), and a soluble recombinant protein chitin deacetylase from Euphausia superba of molecular weight 45 kDa was cloned, expressed, and purified. The full-length cDNA sequence of EsCDA-9k was 1068 bp long and encoded 355 amino acid residues that contained the typical domain structure of carbohydrate esterase family 4. The predicted three-dimensional structure of EsCDA-9k showed a 67.32% homology with Penaeus monodon. Recombinant chitin deacetylase had the highest activity at 40 °C and pH 8.0 in Tris-HCl buffer. The enzyme activity was enhanced by metal ions Co2+, Fe3+, Ca2+, and Na+, while it was inhibited by Zn2+, Ba2+, Mg2+, and EDTA. Molecular simulation of EsCDA-9k was conducted based on sequence alignment and homology modeling. The EsCDA-9k F18G mutant showed a 1.6-fold higher activity than the wild-type enzyme. In summary, this is the first report of the cloning and heterologous expression of the chitin deacetylase gene in Euphausia superba. The characterization and function study of EsCDA-9k will serve as an important reference point for future application.


Asunto(s)
Euphausiacea , Animales , Clonación Molecular , Alineación de Secuencia , Proteínas Recombinantes/genética , Proteínas Recombinantes/metabolismo , Amidohidrolasas/metabolismo , Quitina
20.
Genes (Basel) ; 15(2)2024 Feb 10.
Artículo en Inglés | MEDLINE | ID: mdl-38397217

RESUMEN

Different species of toothed whales (Odontoceti) exhibit a variety of tooth forms and enamel types. Some odontocetes have highly prismatic enamel with Hunter-Schreger bands, whereas enamel is vestigial or entirely lacking in other species. Different tooth forms and enamel types are associated with alternate feeding strategies that range from biting and grasping prey with teeth in most oceanic and river dolphins to the suction feeding of softer prey items without the use of teeth in many beaked whales. At the molecular level, previous studies have documented inactivating mutations in the enamel-specific genes of some odontocete species that lack complex enamel. At a broader scale, however, it is unclear whether enamel complexity across the full diversity of extant Odontoceti correlates with the relative strength of purifying selection on enamel-specific genes. Here, we employ sequence alignments for seven enamel-specific genes (ACP4, AMBN, AMELX, AMTN, ENAM, KLK4, MMP20) in 62 odontocete species that are representative of all extant families. The sequences for 33 odontocete species were obtained from databases, and sequences for the remaining 29 species were newly generated for this study. We screened these alignments for inactivating mutations (e.g., frameshift indels) and provide a comprehensive catalog of these mutations in species with one or more inactivated enamel genes. Inactivating mutations are rare in Delphinidae (oceanic dolphins) and Platanistidae/Inioidea (river dolphins) that have higher enamel complexity scores. By contrast, mutations are much more numerous in clades such as Monodontidae (narwhal, beluga), Ziphiidae (beaked whales), Physeteroidea (sperm whales), and Phocoenidae (porpoises) that are characterized by simpler enamel or even enamelless teeth. Further, several higher-level taxa (e.g., Hyperoodon, Kogiidae, Monodontidae) possess shared inactivating mutations in one or more enamel genes, which suggests loss of function of these genes in the common ancestor of each clade. We also performed selection (dN/dS) analyses on a concatenation of these genes and used linear regression and Spearman's rank-order correlation to test for correlations between enamel complexity and two different measures of selection intensity (# of inactivating mutations per million years, dN/dS values). Selection analyses revealed that relaxed purifying selection is especially prominent in physeteroids, monodontids, and phocoenids. Linear regressions and correlation analyses revealed a strong negative correlation between selective pressure (dN/dS values) and enamel complexity. Stronger purifying selection (low dN/dS) is found on branches with more complex enamel and weaker purifying selection (higher dN/dS) occurs on branches with less complex enamel or enamelless teeth. As odontocetes diversified into a variety of feeding modes, in particular, the suction capture of prey, a reduced reliance on the dentition for prey capture resulted in the relaxed selection of genes that are critical to enamel development.


Asunto(s)
Delfines , Ballenas , Humanos , Animales , Filogenia , Ballenas/genética , Delfines/genética , Alineación de Secuencia , Esmalte Dental
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA